add multimodal executorch support #39832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

mergennachin wants to merge 3 commits into huggingface:main from mergennachin:add-multimodal-executorch-support

+358 −7

mergennachin commented Jul 31, 2025 •

edited

Loading

New Class: TorchExportableModuleForImageTextLM

Dedicated wrapper for image-text language models:

Purpose: Handles multimodal models that need inputs_embeds instead of input_ids
Architecture: Automatically chooses HybridCache vs StaticCache based on model config
Usage: Takes embeddings from vision encoder + text tokenizer as input

New Class: ImageEncoderExportableModule

Wrapper for vision encoder components:

Purpose: Exports the vision processing pipeline (vision_tower → multi_modal_projector)
Function: Converts images to language-compatible embeddings
Integration: Works with TorchExportableModuleForImageTextLM for complete multimodal export

  Multimodal Model Export:

  # Vision encoder export
  vision_encoder = ImageEncoderExportableModule(model)
  exported_vision = vision_encoder.export()

  # Text decoder export  
  text_decoder = TorchExportableModuleForImageTextLM(model.language_model)
  exported_text = text_decoder.export()

  Runtime Usage:

  # Process image → embeddings
  image_embeddings = exported_vision.module()(pixel_values)

  # Process text → embeddings
  text_embeddings = model.embed_tokens(text_ids)

  # Combined inference
  inputs_embeds = torch.cat([image_embeddings, text_embeddings], dim=1)
  logits = exported_text.module()(inputs_embeds=inputs_embeds, cache_position=cache_position)


          Make sure Moshi is exportable with static cache

371e02b

Contributor

github-actions bot commented Jul 31, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: moshi

mergennachin force-pushed the add-multimodal-executorch-support branch 5 times, most recently from 14ae06d to 70e366e Compare

July 31, 2025 21:35


          Add multimodal support to ExecuTorch integration

ff1ac47

This commit enhances the ExecuTorch integration to support multimodal models like Gemma-3, LLaVA, and other vision-language models.

Key changes:
- Enhanced TorchExportableModuleWithHybridCache to support inputs_embeds parameter and multimodal configs
- Added TorchExportableModuleForImageTextLM for image-text language models
- Added ImageEncoderExportableModule for vision encoders
- Added a test for multimodal functionality

This enables ExecuTorch export for vision-language models while maintaining backward compatibility with text-only models.

mergennachin force-pushed the add-multimodal-executorch-support branch from 162df79 to ff1ac47 Compare

July 31, 2025 21:57


          Merge branch 'main' into add-multimodal-executorch-support

c46c6cd

zucchini-nlp reviewed

View reviewed changes

Member

zucchini-nlp left a comment

Hey, thanks a lot for the PR! I agree that we need to export the LM and vision backbones separately, and handle input merging manually. Left a few comments, imo we should make sure different types of multimodal arch can be exportable (i.e. expected inputs, config attr names)

src/transformers/integrations/executorch.py

Comment on lines +899 to +900

    
                      if not hasattr(model.config, "text_config") or not hasattr(model.config.text_config, "use_cache") or model.config.text_config.use_cache is False:

                          raise ValueError("The model must have caching enabled to be performant.")

Member

zucchini-nlp Aug 1, 2025

model.get_text_config() is more reliable because it is not always called text_config. And since it's accessed a lot below, we can just save it in self.text_config = model.get_text_config()

src/transformers/integrations/executorch.py

    
                      # This is the same as sdpa, but mask creation does not use `vmap` which is not exportable

                      ALL_MASK_ATTENTION_FUNCTIONS.register("sdpa_without_vmap", sdpa_mask_without_vmap)

                      ALL_ATTENTION_FUNCTIONS.register("sdpa_without_vmap", ALL_ATTENTION_FUNCTIONS["sdpa"])

                      self.model.model.config._attn_implementation = "sdpa_without_vmap"

Member

zucchini-nlp Aug 1, 2025

Let's use public API - model.set_attn_implementation("sdpa_without_vmap")

src/transformers/integrations/executorch.py

Comment on lines +954 to +963

    
                      if hasattr(self.model, "base_model_prefix"):

                          base = getattr(self.model, self.model.base_model_prefix, self.model)

                          model_device = base.device

                      elif hasattr(self.model, "model"):

                          model_device = self.model.model.device

                      else:

                          model_device = "cpu"

                          logging.warning(

                              "TorchExportableModuleForImageTextLM.export Can't infer device from the model. Set to CPU by default."

                          )

Member

zucchini-nlp Aug 1, 2025

hmm, I think model.device would be fine. The model here is the language backbone

src/transformers/integrations/executorch.py

    
                      super().__init__()

                      self.model = model

                  def forward(self, pixel_values):

Member

zucchini-nlp Aug 1, 2025

most models currently require extra inputs such as num_patches, image_attn_mask etc.

src/transformers/integrations/executorch.py

Comment on lines +1014 to +1016

    
                      vision_outputs = self.model.vision_tower(pixel_values=pixel_values).last_hidden_state

                      image_features = self.model.multi_modal_projector(vision_outputs)

                      return image_features

Member

zucchini-nlp Aug 1, 2025

Ig self.model is the multimodal model. We should use model.get_image_features() which handles the pipeline correctly for the given model, because some models might need extra ops on top of this

src/transformers/integrations/executorch.py

    
                  return causal_mask

              class TorchExportableModuleForImageTextLM(torch.nn.Module):

Member

zucchini-nlp Aug 1, 2025

I feel like this is same as TorchExportableModuleForDecoderOnlyLM with the only diff that the input model in multimodal. We could re-use TorchExportableModuleForDecoderOnlyLM and ask users to export the language backbone explicitly like TorchExportableModuleForDecoderOnlyLM(model.language_model)

Author

mergennachin commented Aug 1, 2025

Hey @zucchini-nlp

Thanks a lot for the thoughtful reviews.

@jackzhxng will take this over the finish line in #39836

I'm gonna close this PR for the time being but hope @jackzhxng can incorporate some of your suggestions and recommendations in the PR.

mergennachin closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet